AITopics

Country:

Asia > Singapore (0.04)
North America > Canada (0.04)

Genre: Research Report > Experimental Study (0.67)

Industry: Education > Curriculum > Subject-Specific Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.46)

Neural Information Processing SystemsFeb-11-2026, 06:18:07 GMT

fd9dd764a6f1d73f4340d570804eacc4-Paper.pdf

constraint, solution code, symbolic execution, (13 more...)

Country:

Asia > Singapore (0.04)
North America > Canada (0.04)

Genre: Research Report (0.93)

Industry: Education > Curriculum > Subject-Specific Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Neural Information Processing SystemsDec-24-2025, 22:32:44 GMT

Synthesizing Tasks for Block-based Programming

artificial intelligence, name change, proceedings, (6 more...)

Technology: Information Technology > Artificial Intelligence (0.76)

Sharifloo, Amir Molzam, Heydari, Maedeh, Kazerooni, Parsa, Maninger, Daniel, Mezini, Mira

Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

arXiv.org Artificial IntelligenceNov-7-2025

Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve - information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure.

benchmark, large language model, machine learning, (19 more...)

2511.04355

Country: Europe > Germany > Hesse (0.15)

Genre:

Research Report > Experimental Study (0.94)
Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Neural Information Processing SystemsAug-17-2025, 10:31:59 GMT

fd9dd764a6f1d73f4340d570804eacc4-Supplemental.pdf

constraint, machine learning, natural language, (17 more...)

Country:

Asia > Singapore (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report > Experimental Study (0.67)

Industry: Education > Curriculum > Subject-Specific Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.46)

Neural Information Processing SystemsAug-17-2025, 10:31:51 GMT

fd9dd764a6f1d73f4340d570804eacc4-Paper.pdf

artificial intelligence, machine learning, natural language, (16 more...)

Country:

Asia > Singapore (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report (0.93)

Industry: Education > Curriculum > Subject-Specific Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language (0.68)

arXiv.org Artificial IntelligenceAug-8-2025

Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development

Shen, Ming, Shu, Raphael, Pratik, Anurag, Gung, James, Ge, Yubin, Sunkara, Monica, Zhang, Yi

We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two-step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one-pass and multi-pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role-based multi-agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi-agent systems to provide practical insights for future development.

agent, artificial intelligence, optimization, (15 more...)

2505.16086

Country:

North America > Mexico (0.28)
Asia > Middle East (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.67)

arXiv.org Artificial IntelligenceMay-13-2025

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

Xu, Kai, Mao, YiWei, Guan, XinYi, Feng, ZiLong

The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks primarily focused on code generation accuracy, but these benchmarks have gradually become saturated. Benchmark saturation weakens their guiding role for LLMs. For example, HumanEval Pass@1 has reached 99.4% and MBPP 94.2%. Among various attempts to address benchmark saturation, approaches based on software engineering have stood out, but the saturation of existing software engineering benchmarks is rapidly increasing. To address this, we propose a new benchmark, Web-Bench, which contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5 to 10 years of experience, each presents a significant challenge. On average, a single project takes 4 to 8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1, significantly lower (better) than SWE-Bench's Verified (65.4%) and Full (33.8%) scores. Finally, we discuss that in any development field, Standards and Frameworks represent foundational knowledge and efficiency tools, respectively, and LLMs require optimization tailored to them.

large language model, llm code benchmark, machine learning, (16 more...)

2505.07473

Genre:

Research Report (0.64)
Workflow (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

arXiv.org Artificial IntelligenceFeb-26-2025

Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval

Wu, Jiarong, Chen, Songqiang, Cao, Jialun, Lo, Hau Ching, Cheung, Shing-Chi

Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs' end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the generated code in specific programming languages. However, the evaluation scores revealed in this way provide a little hint as to the bottleneck of the code generation -- whether LLMs are struggling with their problem-solving capability or language-coding capability. To answer this question, we construct PseudoEval, a multilingual code generation benchmark that provides a solution written in pseudocode as input. By doing so, the bottleneck of code generation in various programming languages could be isolated and identified. Our study yields several interesting findings. For example, we identify that the bottleneck of LLMs in Python programming is problem-solving, while Rust is struggling relatively more in language-coding. Also, our study indicates that problem-solving capability may transfer across programming languages, while language-coding needs more language-specific effort, especially for undertrained programming languages. Finally, we release the pipeline of constructing PseudoEval to facilitate the extension to existing benchmarks. PseudoEval is available at: https://anonymous.4open.science/r/PseudocodeACL25-7B74.

code generation, programming language, pseudocode, (15 more...)

2502.19149

Country:

North America > United States > California > Sacramento County > Sacramento (0.04)
North America > Canada > British Columbia > Vancouver (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsJan-16-2025, 22:59:09 GMT

Synthesizing Tasks for Block-based Programming

Block-based visual programming environments play a critical role in introducing computing concepts to K-12 students. One of the key pedagogical challenges in these environments is in designing new practice tasks for a student that match a desired level of difficulty and exercise specific programming concepts. Our methodology is based on the realization that the mapping from the space of visual tasks to their solution codes is highly discontinuous; hence, directly mutating reference task \task {in} to generate new tasks is futile. Then, the algorithm performs symbolic execution over a code \code {out} to obtain a visual task \task {out}; this step uses the Monte Carlo Tree Search (MCTS) procedure to guide the search in the symbolic tree. We demonstrate the effectiveness of our algorithm through an extensive empirical evaluation and user study on reference tasks taken from the Hour of Code: Classic Maze challenge by Code.org and the Intro to Programming with Karel course by CodeHS.com.

block-based programming, synthesizing task, task task, (5 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.61)